Load packages and trilogy data sets

Let’s start by loading the materials we’ll need:

The main package we’ll use is the tidyverse, which is actually a collection of R packages with consistent design philosophy, grammar, and data structures.

To pull the current versions of the datasets, we’ll follow the steps outlined in the Getting Started > Use in Reproducible Research vignette. That’s why you’ll see a long, alphanumeric code in the links below, specifying precisely what version of the data is being used.

tmi <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/tmi.csv")
atu_df <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/atu_df.csv")
atu_seq <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/atu_seq.csv")
aft <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/aft.csv")

This allows us to explicitly reference a version of the data so that any research we do can be precisely replicated by others. For instance, if you wanted to pull an old (and not yet cleaned-up) version of the aft dataset, you’d just need to go back in the GitHub history and run the following:

old_aft <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/f0fb12d108734847114f17980b05686a26305e38/data/aat.csv")

Overview

Summary Stats

The table below shows the distinct count of motifs and tale types in Trilogy datasets.

x <- 
  atu_seq %>% 
  # Get all tale IDs and motif IDs within them
  select(motif,atu_id) %>% 
  mutate(in_atu_seq = T) %>%
  # Pull in all tale IDs present in annotated tales.
  # Note that counting motifs in annotated tales assumes 
  # that each tale has each motif in the canonical sequence.
  left_join(
    aft %>% select(atu_id) %>% mutate(in_aft = T),
    by = "atu_id"
  ) %>%
  distinct(motif, in_atu_seq, in_aft) %>%
  mutate(in_aft = if_else(is.na(in_aft),F,T)) %>%
  # Don't keep multiple rows per motif, count as present (i.e. 'TRUE')
  group_by(motif, in_atu_seq) %>%
  filter(in_aft == max(in_aft))

m <-
  tmi %>%
  select(motif = id) %>%
  distinct(motif) %>%
  mutate(in_tmi = T) %>%
  full_join(x, by = "motif") %>%
  mutate(
    in_tmi =     if_else(is.na(in_tmi),F,in_tmi),
    in_atu_seq = if_else(is.na(in_atu_seq),F,in_atu_seq),
    in_aft =     if_else(is.na(in_aft),F,in_aft)
  ) 

m_sum <-
  m %>%
  summarise(
    in_tmi =     sum(in_tmi),
    in_atu_seq = sum(in_atu_seq),
    in_aft =     sum(in_aft)
  ) %>%
  mutate(unit = "motifs")
  
t <-
  atu_df %>%
  distinct(atu_id) %>%
  mutate(in_atu_df = T) %>%
  left_join(
    atu_seq %>% 
      distinct(atu_id) %>%
      mutate(in_atu_seq = T),
    by = "atu_id"
  ) %>%
  left_join(
    aft %>%
      distinct(atu_id) %>%
      mutate(in_aft = T),
    by = "atu_id"
  ) %>%
  mutate(
    in_atu_seq = if_else(is.na(in_atu_seq),F,in_atu_seq),
    in_aft = if_else(is.na(in_aft),F,in_aft)
  ) 

t_sum <-
  t %>%
  summarise(
    in_atu_df =  sum(in_atu_df),
    in_atu_seq = sum(in_atu_seq),
    in_aft =     sum(in_aft)
  ) %>%
  mutate(unit = "tale types")
  
m_sum %>%
  bind_rows(t_sum) %>%
  select(
    unit, tmi = in_tmi, 
    atu_df = in_atu_df, atu_seq = in_atu_seq, aft = in_aft
  ) %>%
  rmarkdown::paged_table()
rm(m); rm(m_sum); rm(t); rm(t_sum)

Note that motifs do not exist in atu_df or atu_combos, because those datasets contain one row per atu_id. Similarly, tale types (atu_ids) do not exist in tmi, because that dataset contains one row per motif.

The Trilogy’s datasets are linked by two key identifiers: motifs (i.e. motif_id) and tale types (i.e. atu_id). Understanding how these datasets overlap, and the proportion of available motifs and tale types which make them up, is necessary to using them successfully.

Motif Intersections

Below is an upset plot showing the intersection of discrete motifs across the various datasets which comprise the Trilogy. Adding the value of the bars above each cell containing a dot should add up to the count of motifs shown in the Summary Stats tab. For instance, the aft contains 830 motifs (826 + 4).

library(UpSetR)

motif_lists <- 
  list(
    tmi = tmi %>% distinct(id) %>% .$id, 
    atu_seq = atu_seq %>% distinct(motif) %>% .$motif, 
    aft = x %>% filter(in_aft) %>% .$motif
  )

upset(
  fromList(motif_lists), 
  order.by = "freq",
  mainbar.y.label = "Intersection Size", 
  sets.x.label = "Motifs per Dataset"
)

rm(x); rm(motif_lists)

unjoined_motifs <-
  atu_seq %>% 
  select(motif) %>%
  distinct() %>%
  anti_join(tmi, by = c("motif" = "id"))

Observations:

  • The majority of motifs from the tmi are not present in the ATU (i.e. atu_seq). Specifically, 42,457 of the 46222 motifs in the tmi (91.9%) are not present in the ATU. This means that the tale types from the ATU make use of only 8.1% of the available motifs from the TMI.
  • Of the motifs which are present in the ATU (atu_seq), most (n = 2969/3799) do not have corresponding annotated texts in the aft.
  • There are 826 motifs present among the tale types represented in the aft corpus. This is a minuscule 1.8% of the total available motifs in the tmi. Fortunately, it can be increased with time and dedication.
  • There are a small number of odd instances (n = 34, or 30 + 4), where a motif ID is present in atu_seq, but not in the tmi. In 4 of these instances, there is one or more corresponding tale text in the aft which contains a non-tmi motif.1

Tale Type Intersections

Below is an upset plot showing the intersection of discrete tale types across the various datasets which comprise the Trilogy. A quick clarification may be helpful: atu_df contains all tale types, while atu_seq contains only tale types with motifs identified. Additional documentation regarding each of these can be found in the data dictionary.

type_lists <- 
  list(
    atu_df = atu_df %>% distinct(atu_id) %>% .$atu_id, 
    atu_seq = atu_seq %>% distinct(atu_id) %>% .$atu_id, 
    aft = aft %>% distinct(atu_id) %>% .$atu_id
  )
upset(
  fromList(type_lists), 
  order.by = "freq",
  mainbar.y.label = "Intersection Size", 
  sets.x.label = "Tale Types per Dataset"
)

rm(unjoined_motifs); rm(type_lists)

Observations:

  • Not all atu_ids in atu_df are in atu_seq, because some of the tale summaries from atu_df do not reference any motif IDs.2. Specifically, 597 tale types in the ATU do not have distinct motif IDs identified.
  • This means that an atu_id can be present in atu_df and in aft, but not be included in atu_seq, which means the text version of the tale cannot be referenced against an available list of motifs. Fortunately, this only applies to 8 tale types.
  • There are 174 tale types which occur across all three datasets: atu_df, atu_seq, and aft.
  • Most of the tale types from the ATU (atu_df and atu_seq) are not present in the aft. This underscores the need for more annotated texts.

Exercise:


Motifs

The tmi is comprised of 46222 distinct motifs.3 It is grouped into 23 ‘chapters’, including: Myths, Animals, Tabu, Magic, Death, Marvels, Ogres, Tests, Wisdom and Folly, Deceptions, Reversals of Fortune, Ordaining the Future, Chance and Fate, Society, Rewards and Punishments, Captives and Fugitives, Cruelty, Sex, Nature of Life, Religion, Traits of Character, Humor, Miscellaneous. Beneath the chapter level are nested levels of groups, named as follows:

Distribution by level

The most populated level of the index (i.e. ‘3’) is that of the initial subdivision, indicating that there are frequently no splits made in the motif identified. While the index structure would allow for each subsequent level (i.e. levels 4 - 6+) to have increasing numbers of more finely grained motifs, these either do not exist or have not been filled in.

tmi %>%
  ggplot(aes(x = level)) +
  geom_histogram(stat="count") +
  theme_minimal() + 
  theme(plot.title.position = "plot") +
  labs(
    title = "Most motifs are in the subdivisions",
    subtitle = "From chapters (level 0) through subdivisions (levels 3-6)",
    x = "Depth within index",
    y = "Distinct motif entries"
  )


Flat format

The excerpt below shows how this hierarchy structure is represented in the ‘flat’ dataset:

tmi %>%
  filter(level_2 == "B122") %>%
  select(chapter_id,id,motif_name,level,starts_with("level_")) %>%
  select(-level_5,-level_6) %>%
  arrange(id) %>%
  rmarkdown::paged_table()

Note that some motifs are not fit into the hierarchical level format (i.e. level = NA). This occurs when there is a zero indicator at one of the decimal indices, since this creates a break in the hierarchical structure. For instance, in the “B122” section, we find B122.1 as a parent motif for B122.1.1-2, but there is no B122.0 to serve as a parent for B122.0.1.


Summary

The table below shows how many motifs (i.e. n_motifs) are in each chapter, and each level (if you expand the row).

library(reactable)
tmi %>%
  group_by(chapter_name,level) %>%
  summarize(n_motifs = n_distinct(id)) %>%
  reactable(
    groupBy = c("chapter_name"),
    columns = list(
      level = colDef(
        aggregate = "count",
        format = list(
          aggregated = colFormat(suffix = " levels")
        )
      ),
      n_motifs = colDef(aggregate = "sum")
    )
  )
summary_df <-
  tmi %>%
  group_by(chapter_id,chapter_name,level_0,level_1,level_2,level_3) %>%
  summarize(n = n()) %>% 
  ungroup() %>%
  left_join(tmi %>% select(id,level_0_name = motif_name), by = c("level_0" = "id")) %>%
  left_join(tmi %>% select(id,level_1_name = motif_name), by = c("level_1" = "id")) %>%
  left_join(tmi %>% select(id,level_2_name = motif_name), by = c("level_2" = "id")) %>%
  left_join(tmi %>% select(id,level_3_name = motif_name), by = c("level_3" = "id")) %>%
  select(
    starts_with("chapter_"),starts_with("level_0"),
    starts_with("level_1"),starts_with("level_2"),starts_with("level_3"),n
  )

Tale Types

Tale types are derived from the Aarne–Thompson–Uther Index (ATU). It is represented in the Trilogy by three distinct datasets: atu_df, atu_seq, and atu_combos. Additional documentation regarding each of these can be found in the data dictionary.

Types and Descriptions

The atu_df is comprised of 2247 distinct tale types, each with a formal identifier (atu_id). It is grouped into 7 chapters, which are broken into sub-sections, or divisions. There are 42 divisions in the index. You can click on the treemap below to explore each chapter and its divisions and subdivisions:

stack <-
  atu_df %>%
  group_by(chapter) %>%
  summarize(n = n()) %>% ungroup() %>%
  mutate(parent = "") %>% rename(child = chapter) %>% select(parent,child,n) %>%
  bind_rows(
    atu_df %>% select(parent = chapter, child = division) %>%
      group_by(parent,child) %>% summarize(n = n()) %>% ungroup()
  ) %>%
  bind_rows(
    atu_df %>% select(parent = division, child = sub_division) %>%
      group_by(parent,child) %>% summarize(n = n()) %>% ungroup()
  ) %>%
  ungroup() %>%
  filter(!is.na(child)) %>%
  mutate(id = str_c("r_",row_number()))

plotly::plot_ly(
  type = "treemap", # "sunburst"
  branchvalues = "total",
  labels = stack$child,
  values = stack$n,
  parents = stack$parent
)
rm(stack)

Motif Sequences

summary_atu_seq <- 
  atu_seq %>%
  left_join(
    aft %>% distinct(atu_id) %>% mutate(in_aft = T)
  ) %>%
  mutate(in_aft = if_else(is.na(in_aft),F,in_aft)) %>%
  group_by(atu_id,in_aft) %>%
  summarize(
    variants = max(tale_variant),
    n_motifs = n_distinct(motif)
  ) %>%
  ungroup() 

The atu_seq dataset has one row for each occurrence of a TMI motif within a tale type from the ATU index. It was produced by pulling motif IDs from the tale_type description from the atu_df dataset. There are an average of 41.6 tale variants4 and 2.8 motifs per tale type, with the distributions shown below:

library(patchwork)

p1 <-
  summary_atu_seq %>%
  slice_min(order_by = variants, prop = 0.99) %>%
  ggplot(aes(x = variants)) +
  geom_histogram(stat="count") +
  theme_minimal() 

p2 <-
  summary_atu_seq %>%
  ggplot(aes(x = n_motifs)) +
  geom_histogram(stat="count") +
  theme_minimal() 

p <- p1 / p2

p + plot_annotation(
  title = 'How many stories? How many parts?',
  subtitle = 'Distribution of variant and motif counts within the ATU',
  caption = 'NB: Outliers removed from variants'
)

rm(p); rm(p1); rm(p2)

The table below shows the average number of tale variants and the average number of motifs per tale type, in both the atu_seq and aft.

summary_atu_seq %>%
  summarize(
    total_variants = sum(variants, na.rm = T),
    avg_variants = round(mean(variants, na.rm = T), digits = 1),
    med_variants = round(median(variants, na.rm = T), digits = 1),
    avg_motifs = round(mean(n_motifs, na.rm = T), digits = 1)
  ) %>%
  mutate(dataset = "atu_seq") %>%
  select(dataset,total_variants,avg_variants,med_variants,avg_motifs) %>%
  bind_rows(
    summary_atu_seq %>%
      filter(in_aft == T) %>%
      summarize(
        total_variants = sum(variants, na.rm = T),
        avg_variants = round(mean(variants, na.rm = T), digits = 1),
        med_variants = round(median(variants, na.rm = T), digits = 1),
        avg_motifs = round(mean(n_motifs, na.rm = T), digits = 1)
      ) %>%
      mutate(dataset = "aft") %>%
      select(dataset,total_variants,avg_variants,med_variants,avg_motifs)
  ) %>%
  rmarkdown::paged_table()

Tale Variants

For some tales, multiple combinations of motifs are noted as possible permutations of the tale (for example, ATU 605A is a story in which “A young man, born of an animal… or from a giant… [B631, F611.1.1, F611.1.11-F611.1.15, T516] develops great strength (at the forge, in the forest, in war, by suckling for many years [F611.2.1, F611.2.3]…)”). In these instances, all of the possible permutations are listed as specific variants of the tale type. When ranges of motifs are referenced (e.g. F611.1.11-F611.1.15, above) all motifs within that range are included and provided with different variants.5

The table below shows the tale types with the top 1% of variants derived from the method described above:

summary_atu_seq %>%
  slice_max(order_by = variants, prop = 0.01) %>%
  rmarkdown::paged_table()

Here, we see that 6.1884^{4} (i.e. 46332 + 15552) tale variants out of the total 6.8245^{4} are due to 2 extreme outliers caused by a combinatorial explosion. Without these included, there would be 6361 tale variants, derived from a total of 1642 tale types in the atu_seq dataset.


Annotated Folktales



  1. The unmatched motif IDs include: H1014, F111.7, X751, X761, K461.2.1, J115.4, J1744.4, C111.10, B478, X1122.3.1, X208.2, L452.1.7, T721.5, T101, D963, D590, F605.2, Z3, B581.1.2, D371.1, N552.1.1, F661.1.1, K542.1, K2135.1, A221.3, D16102.2, W245, H161.1, K19.5.3, D638.1, R195, J941, R131.1.3, T74.0.1. One can inspect these individually and compare them to the tmi and find that there are similar motifs that don’t quite match. For instance, there is no “F661.1.1”, but there is a “F661.11”, or “Skillful Archer Uses Arrow As Boomerang”.↩︎

  2. Motifs are extracted from the tale summaries present in the ATU using the code here. An example of a tale summary without identified motif IDs is ATU 1342: “During a cold winter, a satyr (wood spirit) meets a man (boy) who is cold and accommodates him in his cave. The satyr watches the man blowing in his hand and is told that in this way he wants to warm his numb fingers. When the satyr serves up a meal, his guest blows on the food and explains that he wants to cool it. The satyr is afraid of this strange human behavior, blowing hot and cold in the same manner, and chases the man away.↩︎

  3. Recall that Yarlott and Finlayson (2016) counted “46,248 motifs and sub-motifs, 41,796 of which have references to tales or tale types.” While there is a difference in the total count of motifs, it is minimal, and it is unclear what they mean by sub-motifs.↩︎

  4. With some extreme outliers, that is.↩︎

  5. Note that when the suffix “ff.” (i.e. and following) is appended to a motif, we do not include all motifs following it, since it is unclear precisely what is intended by this convention.↩︎